Using Envision's Automatic Hand Gesture Detection PyPi package (envisionhgdetector)¶


Wim Pouw (wim.pouw@donders.ru.nl), Bosco Yung, Sharjeel Shaikh, James Trujillo, Gerard de Melo, Babajide Owoyele (Babajide.Owoyele@hpi.de)

isolated

isolated

Info¶

In the following notebook, we are going to use an envisionbox python package. This package is called "envisionhgdetector" and contains functions to automatically annotate gesture, to perform kinematic analysis, and to produce a visualization dashboard. In some other envisionbox module on training a gesture classifier we exhibited an end-to-end pipeline for training a model on particular human behaviors, e.g., head nodding, clapping; and then producing some inferences on new videos. We also already presented how to do DTW analyses for exploring gesture similarity embedding spaces embedding spaces, and we have an introduced dashboards for visualizing gestures alongside static data. This package builds on this work.

Firstly we have trained a convolutional neural network to differientate no gestures, self-adaptors (or non-communicative "move" actions), and a co-speech gesture. We do this based on the SAGA dataset, the Zhubo dataset, and the TED M3D dataset. Given that we have trained it on a bit of variability in terms of datasets and angles, and more than 9000 gestures, we can use this gesture detector to a little bit more varied settings than we could do would we have trained on a single dataset.

Now, don't get too excited! The performance is not extraordinary or anything, and it still awaits proper testing and further updating with better trained models (we are working on it...). Currently not differientating types of gestures (as far as that is possible; we are working on it...). But it is good enough for some purposes to have a quick pass over on a set of videos and get some prominent gestures out. Once we have the gestures, we can do all kinds of other interesting things, e.g., generate gesture kinematic statistics, or generate gesture networks. But now all automatically!

Preprint (FORTHCOMING)¶

For more information we will refer to a preprint here. Still under development.

Package info¶

https://pypi.org/project/envisionhgdetector/

What does envisionhgdetector do¶

  • It tracks upper body, hands, and face landmarks (generating 29 features)
  • It makes an inference based on 25 frames of data, whether it labels no gesture (default implicit label), gesture (label: Gesture), or some kind of movement that is not a gesture (label: Move).
  • It outputs a labeled video, an ELAN file, a confidence timeseries, and a gesture segment list (with labels and start and end times).
  • It then allows you to retrack the gestures with metric pose world landmarks, which allows us to compare gestures as theire positions are given in meters rather than pixels
  • Then we compute kinematic features, produced DTW distances between the gestures
  • Then we create a visualization dashboard that summarizes the gesture kinematics

Installation¶

It is best to install in a conda environment.

conda create -n envision python = 3.9

conda activate envision

Then proceed:

pip install -r requirements.txt

Errors?¶

  1. Make sure you c++ redistributables installed for tensorflow to work: https://learn.microsoft.com/en-us/cpp/windows/latest-supported-vc-redist?view=msvc-170#latest-microsoft-visual-c-redistributable-version

Citation¶

If you use this package, please cite:

  • Pouw, W., Yung, B., Shaikh, S., Trujillo, J., de Melo, G., & Owoyele, B. (2024). envisionhgdetector: Hand Gesture Detection Using a Convolutional Neural Network (Version 0.0.5.2) [Computer software]. https://pypi.org/project/envisionhgdetector/

Citations for the packages and datasets, and work that this builds on¶

Original Noddingpigeon Training code:

  • Yung, B. (2022). Nodding Pigeon (Version 0.6.0) [Computer software]. https://github.com/bhky/nodding-pigeon

Zhubo dataset (used for training):

  • Bao, Y., Weng, D., & Gao, N. (2024). Editable Co-Speech Gesture Synthesis Enhanced with Individual Representative Gestures. Electronics, 13(16), 3315.

SAGA dataset (used for training)

  • Lücking, A., Bergmann, K., Hahn, F., Kopp, S., & Rieser, H. (2010). The Bielefeld speech and gesture alignment corpus (SaGA). In LREC 2010 workshop: Multimodal corpora–advances in capturing, coding and analyzing multimodality.

TED M3D:

  • Rohrer, Patrick. A temporal and pragmatic analysis of gesture-speech association: A corpus-based approach using the novel MultiModal MultiDimensional (M3D) labeling system. Diss. Nantes Université; Universitat Pompeu Fabra (Barcelone, Espagne), 2022.

MediaPipe:

  • Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., ... & Grundmann, M. (2019). MediaPipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172.

DTW:

  • Giorgino, T. (2009). Computing and visualizing dynamic time warping alignments in R: the dtw package. Journal of statistical Software, 31, 1-24.

Soft-DTW:

  • Cuturi, M., & Blondel, M. (2017, July). Soft-dtw: a differentiable loss function for time-series. In International conference on machine learning (pp. 894-903). PMLR.

Creating Elan Files (and a good paper on classification on gestures)

  • Ienaga, Naoto, Alice Cravotta, Kei Terayama, Bryan W. Scotney, Hideo Saito, and M. Grazia Busa. "Semi-automation of gesture annotation by machine learning and human collaboration." Language Resources and Evaluation 56, no. 3 (2022): 673-700.

Kinematic features & DTW

  • Trujillo, J. P., Vaitonyte, J., Simanova, I., & Özyürek, A. (2019). Toward the markerless and automatic analysis of kinematic features: A toolkit for gesture and movement research. Behavior Research Methods, 51, 769-777.
  • Pouw, W., & Dixon, J. A. (2020). Gesture networks: Introducing dynamic time warping and network analysis for the kinematic study of gesture ensembles. Discourse Processes, 57(4), 301-319.

Lets get started¶

For this tutorial, I have two videos that I would like to segment for hand gestures. They all live in the folder: './videos_to_label/'

In [1]:
import os
import glob as glob

videofoldertoday = './videos_to_label/'
outputfolder = './output/'
In [2]:
import glob
from IPython.display import Video

# List all videos in the folder
videos = glob.glob(videofoldertoday + '*.mp4')
# Display single video
Video(videos[0], embed=True, width=200)
Out[2]:
Your browser does not support the video tag.
In [3]:
Video(videos[1], embed=True, width=200)
Out[3]:
Your browser does not support the video tag.

From the pypi package info we see that we can simply use this to get started:

from envisionhgdetector import GestureDetector

# Initialize detector
detector = GestureDetector(
    motion_threshold=0.8,    # Sensitivity to motion
    gesture_threshold=0.5,   # Confidence threshold for gestures
    min_gap_s=0.3,          # Minimum gap between gestures
    min_length_s=0.3        # Minimum gesture duration
)

# Process videos
results = detector.process_folder(
    video_folder="path/to/videos",
    output_folder="path/to/output"
)

Play around with the following parameters to optimize detection¶

The gesture annotations can be finetuned with the settings you have:

  1. confidence level of movement
  2. if movement, then what is the confidence level for the gesture or move category
  3. when should gestures be merged (x second gap) into one
  4. what is the shortest gesture you want to consider (oterwhise remove)

STEP 1: Gesture detection¶

In [4]:
from envisionhgdetector import GestureDetector
import os

# absolute path 
videofoldertoday = os.path.abspath('./videos_to_label/')
outputfolder = os.path.abspath('./output/')

# create a detector object
detector = GestureDetector(motion_threshold=0.9, gesture_threshold=0.5, min_gap_s =0.2, min_length_s=0.3)

# just do the detection on the folder
detector.process_folder(
    input_folder=videofoldertoday,
    output_folder=outputfolder,
)
d:\Programs\Conda_packages\envs\envisiontest\lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
Importing the dtw module. When using in academic works please cite:
  T. Giorgino. Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package.
  J. Stat. Soft., doi:10.18637/jss.v031.i07.

WARNING:tensorflow:From d:\Programs\Conda_packages\envs\envisiontest\lib\site-packages\keras\src\backend\tensorflow\core.py:216: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

Successfully loaded weights from d:\Programs\Conda_packages\envs\envisiontest\lib\site-packages\envisionhgdetector\model\model_weights_20250210_230142.h5

Processing test_female.mp4...
Processing frames: 100%|██████████| 366/366 [00:14<00:00, 25.72it/s]
Generating labeled video...
Generating elan file...
Done processing test_female.mp4, go look in the output folder

Processing test_male.mp4...
Processing frames: 100%|██████████| 420/420 [00:15<00:00, 27.66it/s]
Generating labeled video...
Generating elan file...
Done processing test_male.mp4, go look in the output folder
Out[4]:
{'test_female.mp4': {'stats': {'average_motion': 0.7541706914093063,
   'average_gesture': 0.9675550617669758,
   'average_move': 0.03244493454177347},
  'output_path': 'd:\\Research_projects\\usingenvisionhgdetector\\output\\test_female.mp4.eaf'},
 'test_male.mp4': {'stats': {'average_motion': 0.7645088573917747,
   'average_gesture': 0.9667605282080294,
   'average_move': 0.03323947556547331},
  'output_path': 'd:\\Research_projects\\usingenvisionhgdetector\\output\\test_male.mp4.eaf'}}

The output¶

Below is a printout of the outputs we get. We get an ELAN file, the motion tracking features, the confidence timeseries for the video (predictions), and the gesture segments with begin and end times for gesture labels. Finally, we also get a labeled video that show the gesture label (based on the segment data).

In [5]:
import pandas as pd
import os
# lets list the output
outputfiles = glob.glob(outputfolder + '/*')
for file in outputfiles:
    print(os.path.basename(file))

# load one of the predictions
csvfilessegments = glob.glob(outputfolder + '/*segments.csv')
df = pd.read_csv(csvfilessegments[0])
df.head()
analysis
gesture_segments
labeled_test_female.mp4
labeled_test_male.mp4
retracked
test_female.mp4.eaf
test_female.mp4_features.npy
test_female.mp4_predictions.csv
test_female.mp4_segments.csv
test_male.mp4.eaf
test_male.mp4_features.npy
test_male.mp4_predictions.csv
test_male.mp4_segments.csv
Out[5]:
start_time end_time labelid label duration
0 0.133467 1.468133 1 Gesture 1.334667
1 2.135467 3.003000 2 Gesture 0.867533
2 4.938267 6.673300 3 Gesture 1.735033
3 7.073700 7.974600 4 Gesture 0.900900
4 8.441733 10.310267 5 Gesture 1.868533

Lets assess the labeled video data¶

In [6]:
from moviepy import VideoFileClip
videoslabeled = glob.glob(outputfolder + '/*.mp4')

# need to rerender
clip = VideoFileClip(videoslabeled[0])
clip.write_videofile("./labeled_test_female.mp4")
Video("./labeled_test_female.mp4", embed=True)
MoviePy - Building video ./labeled_test_female.mp4.
MoviePy - Writing video ./labeled_test_female.mp4

                                                                          
MoviePy - Done !
MoviePy - video ready ./labeled_test_female.mp4

Out[6]:
Your browser does not support the video tag.

STEP 2: Segment the gesture videos¶

We have some further functionality in the package that will help in further analysis. Often we want to cut the videos by each gesture event, as these are the moments that were interested in. Therefore in the packages' utils (utilities library) we have a further functionality that takes each labeled video, and then generates subclips per gesture event, and also extracts the normalized features on which the gesture classifications were based. In the output folder there will be a new folder created called output/gesture_segments, with for each video subclips of each gesture and their features.

In [ ]:
# Step 2: Cut videos into segments
from envisionhgdetector import utils
segments = utils.cut_video_by_segments(outputfolder)

STEP 3: Retrack the gesture videos¶

If we want to do kinematic analysis on the gestures, an issue is that we are working in pixel units. Mediapipe allows for tracking body keypoints in metric-estimated units. We need this for our kinematic analysis. Next we generate this. We also generate the tracked videos, so we can do further quality checking.

In [8]:
# Step 3: Create paths
gesture_segments_folder = os.path.join(outputfolder, "gesture_segments")
retracked_folder = os.path.join(outputfolder, "retracked")
analysis_folder = os.path.join(outputfolder, "analysis")

print(f"\nLooking for segments in: {gesture_segments_folder}")
if os.path.exists(gesture_segments_folder):
    segment_files = [f for f in os.listdir(gesture_segments_folder) if f.endswith('.mp4')]
    print(f"Found {len(segment_files)} segment files")
else:
    print("Gesture segments folder not found!")

# Step 3: Retrack gestures with world landmarks
print("\nStep 4: Retracking gestures...")
tracking_results = detector.retrack_gestures(
    input_folder=gesture_segments_folder,
    output_folder=retracked_folder
)
print(f"Tracking results: {tracking_results}")
Looking for segments in: d:\Research_projects\usingenvisionhgdetector\output\gesture_segments
Found 2 segment files

Step 4: Retracking gestures...
Processing test_female.mp4_segment_1_Gesture_0.13_1.47
Processing test_female.mp4_segment_2_Gesture_2.14_3.00
Processing test_female.mp4_segment_3_Gesture_4.94_6.67
Processing test_female.mp4_segment_4_Gesture_7.07_7.97
Processing test_female.mp4_segment_5_Gesture_8.44_10.31
Processing test_male.mp4_segment_1_Gesture_0.00_3.77
Processing test_male.mp4_segment_2_Gesture_4.50_5.44
Processing test_male.mp4_segment_3_Gesture_7.54_9.61
Processing test_male.mp4_segment_4_Gesture_11.24_12.05
Processing test_male.mp4_segment_5_Gesture_12.48_12.95
Successfully retracked 10 gestures
Tracking results: {'tracked_folder': 'd:\\Research_projects\\usingenvisionhgdetector\\output\\retracked\\tracked_videos', 'landmarks_folder': 'd:\\Research_projects\\usingenvisionhgdetector\\output\\retracked'}
In [ ]:
# lets show the new tracked videos
videoslabeled = glob.glob(outputfolder + '/retracked/tracked_videos/*.mp4')

# need to rerender
clip = VideoFileClip(videoslabeled[0])
clip.write_videofile("./retracked.mp4")
Video("./retracked.mp4", embed=True)
MoviePy - Building video ./retracked.mp4.
MoviePy - Writing video ./retracked.mp4

                                                             
MoviePy - Done !
MoviePy - video ready ./retracked.mp4

Out[ ]:
Your browser does not support the video tag.

Step 4: Compute DTW and create visualization¶

The next big step is that we are going to compute a dissimilarity measure using DTW of all gesture comparisons. We also create a dimensionally reduced version of the distance matrix using UMAP, which we use for the visualization. This we save in a distance matrix in the analysis folder. We also create for each gesture a kinematic feature list (such as max speed, max acceleration, McNeillian space, vertical height etc.). This will also be saved in the analysis folder.

In [ ]:
analysis_results = detector.analyze_dtw_kinematics(
        landmarks_folder=tracking_results["landmarks_folder"],
        output_folder=analysis_folder
    )
print(f"Analysis results: {analysis_results}")

Running the above code gives us the kinematic features and a dissimilarity DTW matrix¶

In [13]:
import pandas as pd
import os
# lets list the output
outputfiles = glob.glob(outputfolder + '/analysis/*')
for file in outputfiles:
    print(os.path.basename(file))

# load one of the predictions
csvfilessegments = glob.glob(outputfolder + '/analysis/kinematic_features.csv')
df = pd.read_csv(csvfilessegments[0])
df.head()
dtw_distances.csv
gesture_visualization.csv
kinematic_features.csv
Out[13]:
gesture_id video_id space_use_left space_use_right mcneillian_max_left mcneillian_max_right mcneillian_mode_left mcneillian_mode_right volume_both volume_right ... elbow_peak_velocity_right elbow_peak_velocity_left elbow_mean_velocity_right elbow_mean_velocity_left elbow_peak_acceleration_right elbow_peak_acceleration_left elbow_peak_deceleration_right elbow_peak_deceleration_left elbow_peak_jerk_right elbow_peak_jerk_left
0 test_female.mp4_segment_1_Gesture_0.13_1.47 test 2 1 4 2 42 2 0.018549 0.000327 ... 0.137116 0.170942 0.059420 0.090565 0.839820 1.151486 0.020207 0.063543 11.153141 12.398219
1 test_female.mp4_segment_2_Gesture_2.14_3.00 test 1 1 2 2 2 2 0.004202 0.000037 ... 0.172778 0.156491 0.063768 0.074248 1.235403 2.052149 0.058126 0.048029 21.329444 25.717586
2 test_female.mp4_segment_3_Gesture_4.94_6.67 test 1 2 2 4 2 2 0.017288 0.002753 ... 0.323431 0.337132 0.092979 0.101497 1.687188 2.167280 0.076131 0.142145 27.416663 26.251726
3 test_female.mp4_segment_4_Gesture_7.07_7.97 test 1 1 2 2 2 2 0.002064 0.000065 ... 0.253424 0.210090 0.099401 0.090704 1.765372 1.345821 0.168982 0.126163 31.342102 24.489448
4 test_female.mp4_segment_5_Gesture_8.44_10.31 test 1 2 2 4 2 2 0.005267 0.000349 ... 0.246352 0.291104 0.075126 0.085051 1.690849 1.940592 0.126281 0.038177 20.169547 21.930757

5 rows × 49 columns

In [14]:
detector.prepare_gesture_dashboard(
    data_folder=analysis_folder
    )
Dashboard folders set up successfully:
- Assets folder: d:\Research_projects\usingenvisionhgdetector\output\assets
- Data folder: d:\Research_projects\usingenvisionhgdetector\output\analysis
- 11 videos in assets
App dashboard copied to: d:\Research_projects\usingenvisionhgdetector\output\app.py
CSS file created at: d:\Research_projects\usingenvisionhgdetector\output\assets\styles.css
Run 'python app.py' to start the dashboard in a Python environment.

Running the app¶

Now in your terminal activate your envirnment, cd (change drive) to your output folder, and then activate the app. You can then copy the address in your chrome browser to access the app that is running.

conda activate envision
cd [outputfolder]
python app.py

You app will look like this:

Concluding remarks¶

It is important to test the accuracy of your classifier against some hand-labeled data that was not used to train your model on. We are preparing a preprint where we report this.